Core-Only System Management Interrupt

ABSTRACT

An apparatus, including: a deterministic monitored device; an interconnect to communicatively couple the monitored device to a support circuit; a super queue to queue transactions between the monitored device and the support circuit, the super queue including an operational segment and a shadow segment; a debug data structure; and a system management agent to monitor transactions in the operational segment, log corresponding transaction identifiers in the shadow segment, and write debug data to the debug data structure, wherein the debug data are at least partly based on the corresponding transaction identifiers.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation (and claims the benefit of priorityunder 35 U.S.C. § 120) of U.S. patent application Ser. No. 15/640,532filed on Jul. 1, 2017, entitled “Core-Only System Management Interrupt”.The disclosure of the prior application is considered part of and ishereby incorporated by reference in its entirety in the disclosure ofthis application.

FIELD OF THE SPECIFICATION

This disclosure relates in general to the field of semiconductordevices, and more particularly, though not exclusively, to a system andmethod for core-only periodic system management interrupt (PSMI).

BACKGROUND

Multiprocessor systems are becoming more and more common. In the modernworld, compute resources play an ever more integrated role with humanlives. As computers become increasingly ubiquitous, controllingeverything from power grids to large industrial machines to personalcomputers to light bulbs, the demand for ever more capable processorsincreases.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is best understood from the following detaileddescription when read with the accompanying figures. It is emphasizedthat, in accordance with the standard practice in the industry, variousfeatures are not necessarily drawn to scale, and are used forillustration purposes only. Where a scale is shown, explicitly orimplicitly, it provides only one illustrative example. In otherembodiments, the dimensions of the various features may be arbitrarilyincreased or reduced for clarity of discussion.

FIG. 1 is a block diagram illustrating selected elements of a debuggingarchitecture, according to one or more examples of the presentspecification.

FIG. 2 is a block diagram of a system-on-a-chip, according to one ormore examples of the present specification.

FIG. 3 is a block diagram illustrating transactions between cores andsupport circuitry in a system, such as a system-on-a-chip, according toone or more examples of the present specification.

FIG. 4 is a block diagram of a system having one or more cores undertest according to one or more examples of the present specification.

FIGS. 5-7 include block diagrams illustrating one or more methods oflogging debug data according to one or more examples of the presentspecification.

FIGS. 8a-8b are block diagrams illustrating a generic vector-friendlyinstruction format and instruction templates thereof according to one ormore examples of the present specification.

FIGS. 9a-9d are block diagrams illustrating an example specificvector-friendly instruction format according to one or more examples ofthe present specification.

FIG. 10 is a block diagram of a register architecture according to oneor more examples of the present specification.

FIG. 11a is a block diagram illustrating both an example in-orderpipeline and an example register renaming an out-of-orderissue/execution pipeline according to one or more examples of thepresent specification.

FIG. 11b is a block diagram illustrating both an example of an in-orderarchitecture core and an example register renaming an out-of-orderissue/execution architecture core to be included in a processoraccording to one or more examples of the present specification.

FIGS. 12a-12b illustrate a block diagram of a more specific in-ordercore architecture, which core would be one of several logic blocks(including other cores of the same type and/or different types) in achip according to one or more examples of the present specification.

FIG. 13 is a block diagram of a processor that may have more than onecore, may have an integrated memory controller, and may have integratedgraphics according to one or more examples of the present specification.

FIGS. 14-17 are block diagrams of computer architectures according toone or more examples of the present specification.

FIG. 18 is a block diagram contrasting the use of a software instructionconverter to convert binary instructions in a source instruction set tobinary instructions in a target instruction set according to one or moreexamples of the present specification.

EMBODIMENTS OF THE DISCLOSURE

The following disclosure provides many different embodiments, orexamples, for implementing different features of the present disclosure.Specific examples of components and arrangements are described below tosimplify the present disclosure. These are, of course, merely examplesand are not intended to be limiting. Further, the present disclosure mayrepeat reference numerals and/or letters in the various examples. Thisrepetition is for the purpose of simplicity and clarity and does not initself dictate a relationship between the various embodiments and/orconfigurations discussed. Different embodiments may have differentadvantages, and no particular advantage is necessarily required of anyembodiment.

Emulation is a feature that vendors of integrated circuits, such asmicroprocessors and systems-on-a-chip, may use to improve upon and debugtheir silicon designs. For example, a system may be started from a knownstate, such as a reset state, or some other known state, andtransactions may be logged, including timestamps that tie thetransactions to a particular clock cycle, so that the processor statecan be reconstructed in an emulator later. Logged transactions may bestored in a debugging data structure, which is configured to hold theinformation necessary for reconstructing the processor state in lateremulation. If the processor encounters an error during execution, thenthe debug information may be flushed to a disk, and the debug datastructure may then be used to “replay” the processor sequence in anemulator. Ideally, the emulator faithfully emulates the processorfunction and stays synchronized, and will encounter the same error atthe same clock count.

If the emulator works correctly, then system designers can use theemulator to debug the system and perform a root cause analysis on thesilicon design.

The Intel® periodic system management interrupt (PSMI) has long been acore component of the Intel® debug architecture, while othermicroprocessor manufacturers provide similar system management interrupt(SMI) features. Thus, it should be noted that throughout thisspecification, PSMI is used as a representative of a class of SMIfeatures that may be provided in many different platforms that enablethem to operate in a system management mode (SMM) or similar mode, andshould be understood as a nonlimiting example.

A system agent, such as the system agent of FIG. 13, may act as a systemmanagement agent for PSMI purposes. PSMI provides very useful debuggingand root cause analysis features. Debugging at the system level can bedifficult because observability may be limited to the external bus,which can be far away in time (in terms of processor cycles) from theactual failure and its root cause. Thus, an emulator may be used toattempt to replay the whole scenario on a clock-by-clock basis toattempt to reproduce the system failure. If failure is successfullyreproduced in the emulator, then the designer can easily perform a rootcause analysis. Within the emulator, the designer has insight intointernal signaling and debug hooks. On the other hand, if the wholescenario is replayed and the failure does not manifest, then thedesigner may conclude that the issue is with a circuit or speed path,rather than a logic bug. In either case, PSMI helps to perform the rootcause analysis.

PSMI collects all inputs to the PSMI boundary within the CPU from amicroarchitectural known state up to the failure. This information maybe stored in a debug data structure, which can be used in the replay onthe emulator.

To faithfully reconstruct the processor state, debug data are collectedso that the scenario can be replayed in a deterministic manner on anaccurate clock-by-clock basis. However, as computing systems have becomemore complex, and CPU development has evolved, the deterministic domainof the PSMI was initially narrowed from the socket level to the systemagent level. Thus, collecting deterministic data became more difficult.As processors have evolved, some processor applications have become socomplex that it is impractical to provide a deterministic replay, and inthose cases PSMI may be abandoned altogether. In other examples, largeinvestments may need to be made to ensure that the support circuitry(e.g., Intel® uncore) portion of a system-on-a-chip remainsdeterministic so that the processor state can be faithfully replayed.However, this can be time-consuming and expensive.

In particular, modern design of a system-on-a-chip is largely movingtoward an intellectual property (IP) block-based model. In the IP blockmodel, certain circuits or subcircuits, including all or most of thesupport circuitry design in many cases, are embodied in preconfigured IPblocks that provide near “black box” functionality. These IP blocks canbe independently developed and maintained, and have well-defined,modular inputs and outputs. This can significantly reduce cost andeffort in designing an integrated circuit, and can lead to better reuseof certain resources.

However, the use of IP blocks presents additional challenges for PSMI.Because an SoC may be based on a number of nondeterministic componentsthat no longer have tightly coupled timing and clocking correlations,the system as a whole loses its deterministic aspect. Furthermore, someIP blocks on an SoC may even be imported from third-party vendors, sothat there is no visibility into the internal workings, and no practicalway to make those IP blocks deterministic.

Thus, a modification to legacy PSMI architectures can be providedwherein a certain portion of an integrated circuit may be treated asdeterministic, and the rest of the circuit or SoC may be treated asnondeterministic. The deterministic portion may be the processor core,or may be some other IP block such as an integrated graphics processor,or any other logical block that may need to be emulated and debugged. Aslong as the monitored device remains deterministic, it is not necessaryfor the rest of the system to be deterministic. Rather, transactions arelogged at a virtual boundary defined between the deterministic andnondeterministic portions of the circuit. Logging of outboundtransactions may be relatively less important, because thosetransactions do not generally provide a substantive effect on theprocessor state. However, logging of inbound transactions may berelatively more important, as inbound transactions do affect theprocessor state.

Advantageously, the so-called “core only” PSMI of the presentspecification provides a clock-by-clock accurate and deterministic debugstructure for the core or other monitored device, and also providesminimal intrusiveness or perturbations due to the operation of debugdata that may be collected for later replay. Manifestly, if thecollection of debug data itself becomes intrusive, that may itselfbecome a source of errors.

In an embodiment of the present specification, the monitored device istreated as a hard IP block. Note that throughout this specification, acore is used in many illustrations and examples as a monitored device.This should be understood to be a nonlimiting and illustrative exampleof how the teachings of this specification can be applied to a specificmonitored device. However, it should be understood that a practitionerin the art, exercising engineering skill, can apply the teachings toother devices, including other hard IP blocks, to provide some similarmonitoring for those blocks. Thus, throughout this specification, wherea core is spoken of as the monitored device, it should be understood tostand as a representative member of the entire class of devices that maybe monitored according to the teachings herein.

In one example, the core virtual boundary, or the region of determinism,is defined at the in-die interface (IDI) bus. An IP collection apparatusmay define a method to sample all IDI input traffic.

In an example, IP collection and replay principles may be based on twodifferent methods for tracing the input to the boundary. First, inboundcollection data may be based on a record in the so-called super queuethat queues transactions between the core and the support circuitry.This queue is referred to as a “super queue” because it is a multi-levelqueue, including an operational portion and a shadow portion. Collectionof outbound transactions, on the other hand, may be based on an IDIon-die logic analyzer trigger (ODLAT) that records asynchronous events,snoops, and power management events. As used in this specification,“asynchronous” events should be understood to include events that occuroutside of the cycle of the IDI bus transactions.

The use of these inbound and outbound methods in parallel allows thehandling of deadlock conditions, and a balanced design effort.

Collection of inbound transactions may be set out as IDI transactions.These may be treated the same as other functional traffic. In manycases, inbound transactions may handle heavy data loads delivered on a64-byte cache line, which can reach a bandwidth of approximately 60 Gbper second on dual channel double data rate (DDR) 3 memory.

Collection of outbound traffic may be sent via a debug trace fabric(DTF), which may be more limited, such as in one example toapproximately 6.4 GB per second at a clock speed of 800 MHz. However,collection of outbound transactions may be performed asynchronously.

By limiting the collection of traffic to the IP perimeter, the teachingsof the present specification allow collection of traffic in adeterministic fashion while limiting the volume of data collected.Furthermore, the heavy data transfer capability associated with theinbound traffic mitigates the intrusiveness of the core-only PSMIdisclosed herein.

This provides advantages over certain legacy solutions that provide PSMIat the system agent level, but require the whole path between core andmemory controller to be deterministic. Those legacy systems may be basedon memory observability trace. Certain embodiments of this requirefull-chip simulation of core and support circuitry (e.g., uncore) to beclock-by-clock accurate. This requires both the core and the supportcircuitry to operate in a deterministic clock reference.

In embodiments of the present specification, the main source of trafficbetween the core and the support circuitry is the super queue (SQ).Thus, a core-only PSMI of the present specification may accuratelysample all transactions that are input to the SQ. To accomplish this, anSQ may be provided that can provide a self-record. In one example, theSQ is divided into a functional segment and a “shadow” segment. Thefunctional segment includes the queue of actual transactions that are togo out to the support circuitry. The shadow segment may includecorresponding transactions that are logged for the purpose of providingreplayability of the transactions. Note that the transaction logged inthe shadow segment need not be the exact same operation as that whichwent out on the actual queue. For example, when a read is performed fromthe operational segment of the SQ, it may not be necessary or desirableto place the same read operation in the same position in the shadowsegment. The read operation does not affect the internal state of thecore, so it may be superfluous to a later simulation. Rather, when theread operation is issued from the SQ, a corresponding write operationmay be placed in the shadow segment of the SQ, in anticipation that theread from memory will be followed by a write of the resulting datafetched from memory. Thus, once a read transaction is issued, therelated entry in the shadow segment of the SQ may be marked as “write1.” A companion write may be marked as “write 2” and allocated on theshadow segment of the SQ. At the end of the transaction flow, write 1collects the data return 64 bytes into pumps. Write 2 includes thetimestamp of the data return and the accumulated responses with itstimestamp.

Although outgoing write operations need not be systematically logged,and the data written by those operations need not be logged, an outgoingtransaction such as a write may result in a snoop being logged. Snoopsand asynchronous events may be captured by associated packets in the IDIODLAT. The IDI ODLAT is a hardware module located on the IDI bus thatmonitors traffic of the IDI bus. When a traffic packet matches criteriafor the IDI ODLAT, the IDI ODLAT issues a packet including the importantattributes of the transactions and timestamps. This packet is sentoutbound through a debug trace fabric (DTF). Then, a trace aggregatormay be issued out through memory or an external port. IDI ODLAT may notnecessarily be a special-purpose component of core-only PSMI. Rather, itmay be a generic debug module that may be used for manual debug, triagebetween core to support circuitry failures, or other purposes. Some ofthe IDI ODLAT tracing capability may be reassigned to collect technologyfor core-only PSMI to realize the teachings of the presentspecification. Note that IDI ODLAT may be particularly used to capturetransactions that are not related to the SQ, such as snoops, powermanagement events, and asynchronous events. Asynchronous events mayinclude in some examples capturing of credit occurrences.

The use of the IDI ODLAT in this context provides a less intrusivemonitoring of outbound transactions, and the IDI ODLAT may even continueto work when the main flow is blocked, such as in the case of a lockscenario.

Thus, the teachings of the present specification realize an architecturein which heavy inbound traffic is traced through the IDI, such as where64 bytes of data are returned on the cache line, in which case acorresponding inbound write may be recorded, along with the actualcontent returned from the cache. Lighter traffic may be traced outbound,or asynchronously, through the IDI ODLAT. This could include snoops,credit, advanced programmable interrupt control (APIC), power managementevents, or other outbound events.

A system and method for core-only PSMI will now be described with moreparticular reference to the attached FIGURES. It should be noted thatthroughout the FIGURES, certain reference numerals may be repeated toindicate that a particular device or block is wholly or substantiallyconsistent across the FIGURES. This is not, however, intended to implyany particular relationship between the various embodiments disclosed.In certain examples, a genus of elements may be referred to by aparticular reference numeral (“widget 10”), while individual species orexamples of the genus may be referred to by a hyphenated numeral (“firstspecific widget 10-1” and “second specific widget 10-2”).

FIG. 1 is a block diagram illustrating selected elements of a debuggingarchitecture according to one or more examples of the presentspecification.

The debugging architecture of FIG. 1 includes a real system 104 and anemulated system 108. Real system 104 begins in a known state 112, forexample a reset state. Note that a reset state is used as a nonlimitingexample, and any known state may be used.

Progressing from known state 112, certain processing operations mayhappen in state 116. These processing operations change the state of thereal system. Note that in some examples, real system 104 may be a systemthat includes both deterministic and nondeterministic components,including a monitored device such as a processing core, and othercircuits such as support circuitry that may not be deterministic withrespect to the emulation. While processing 116 occurs, a core-only PSMImay collect debugging data and store the data in a debug data structure.

At state 120, real system 104 may encounter an error.

Emulated system 108 receives the debug data structure, and starts at asimulated known state 124. Emulated system 108 then progresses throughsimulated processing 128. Assuming that the simulation is faithful, andthat appropriate debugging data have been collected, then the replay ofreal system 104 results in reproduced error 132. Because reproducederror 132 occurs on an emulated system 108, the system designer may havegreater visibility into the processor state, and the state of the logic,thus providing greater opportunity to perform debugging than may beavailable in real system 104.

FIG. 2 is a block diagram of a system-on-a-chip 200 according to one ormore examples of the present specification. In the example of FIG. 2,SoC 200 is divided by a virtual boundary 228 between a deterministicdomain 232 and a nondeterministic domain 236. Note that as used here andthroughout this specification and the claims, deterministic domain 232is deterministic in the sense that its logic state can be exactlyreconstructed on an accurate clock-by-clock basis with appropriatedebugging data. Nondeterministic domain 236 is nondeterministic in thesense that debugging data provided for deterministic domain 232 may notbe capable of reconstructing the state of nondeterministic domain 236 onan exact clock-by-clock basis.

Deterministic domain 232 may include, by way of nonlimiting example,certain hard IP blocks such as cores 210, namely core 0 210-1, core 1210-2, core 2 210-3, and core 3 210-4, and a built-in graphics module GT214.

Nondeterministic domain 236 may include both hard IP and soft IP blocksthat are communicatively coupled to the deterministic block 232 acrossvirtual boundary 228 via, for example, a coherent fabric 270, which iscoherent in the sense of cache coherency. Coherent fabric 270 mayprovide a half-ring architecture.

Nondeterministic domain 236 may include system management agent 216,which may be a system agent and which may provide the core-only PSMI ofthe present specification. Nondeterministic domain 236 may also includeother features not shown, such as a memory controller, a powermanagement unit, a common trace port, a primary fabric, and othersupport circuitry by way of nonlimiting example.

In an example, debug data structures may be collected for hard IP blocksin deterministic domain 232, such as cores 210 and/or GT 214. To providesuccessful replay, or in other words the ability to reproduce thecollected debug data on an emulator, relevant events are collecteddeterministically and the clocking is logged accurately. In one example,a crystal clock may be provided that intentionally runs on a frequencythat is not a pure multiplier of the reference clock. This may be doneto reduce spread spectrum clocking. For example, the crystal clock mayrun at 19.2 MHz, while the reference clock may run at 100 MHz. However,for successful replay via core-only PSMI, a clock-by-clock deterministicreplay must be performed. Thus, the legacy crystal clock may be replacedby a clock that is a pure multiplier of the reference clock. Forexample, if the reference clock is 100 MHz, the crystal clock may run at25 MHz. However, the switch to the crystal clock that is a puremultiplier of the reference clock may be done only on the timestampresources that collect the debug data structure. In the example of FIG.2, the monitored devices include cores 210 and GT 214. The core may havetwo timestamp sources, namely the timestamp counter (TSC, or the coreofficial timestamp located at APIC), and the IDI ODLAT timestamp locatedat the IDI ODLAT.

As illustrated in this figure, core timestamps may be switched to a puremultiplier such as 25 MHz in the collect and replay mode. However, therest of SoC 200, such as devices and circuits within nondeterministicdomain 236 may still run at the original crystal clock frequency, suchas a non-multiplier like 19.2 MHz. The use of a deterministic clock onlywithin deterministic domain 232, while leaving nondeterministic domain236 alone, optimizes the probability of a successful replay. Thetimestamp during a collect and replay mode may be provided in the coreclock, thus allowing the transactor on the emulator to successfullyreplay the scenario.

In another embodiment, IDI ODLAT may be used to capture asynchronousevents. In some embodiments, a device such as GT 214 may lack the IDIODLAT of core 210, and thus may store asynchronous events in P6 staticrandom-access memory (SRAM). This allows the capture of asynchronousevents in addition to synchronous events.

FIG. 3 is a block diagram illustrating transactions between cores 302and support circuitry 308 in a system, such as a system-on-a-chip asillustrated in FIG. 2. In this example, cores 302 may communicate viaIDI 312 for large synchronous transactions, such as reads from a cacheline. IDI 312 may include a core to support circuitry bus 328, and asupport circuitry to core bus 324. IDI 312 may also include an ODLAT320.

Other buses may be provided, and may track other transactions, includingasynchronous transactions. These can include, for example, a serialevents bus 330, which may be used to track asynchronous transactions, aswell as a credit bus 334, which may be used to exchange credits in atiming scheme.

FIG. 4 is a block diagram of a system 400 having one or more cores undertest 402 according to one or more examples of the present specification.System 400 includes a core under test 402, a main memory 404 (forexample, a read only memory (ROM) or dynamic random access memory (DRAM)such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM)), static memorysuch as flash memory, static random access memory (SRAM), volatile highdata rate RAM, or similar, and a secondary memory 418 (for example, apersistent storage device including a hard disk drive or a persistentfast memory such as Intel® 3-D Crosspoint), which communicate with eachother via a bus 430. Main memory 404 includes a quiesce unit 424 totrigger and coordinate a quiesce phase within core under test 402. Coreunder test 402 operates in conjunction with processor 426 to performmethods disclosed herein. In one embodiment, core under test 402utilizes a signal capture 425 internal to core under test 402 to captureinput signals.

System 400 may also include a network interface card 408. System 400also may include a user interface 410 (such as a display unit) and mayprovide a human input device 412, which may include for example akeyboard, an alphanumeric device, a mouse, or similar. System 400 mayinclude a signal generation device 416, such as an integrated speaker.System 400 may also include a peripheral device 436 (for example,wireless or wired communication devices, memory devices, storagedevices, auto processing devices, video processing devices, or similar).

Secondary memory 418 may include a non-transitory computer readablemedium 431, on which may be stored one or more sets of instructions (forexample, software 422), embodying one or more methods or functions asillustrated herein. Software 422 may also reside, completely orpartially, within main memory 404 and/or within core under test 402during execution by system 400. Software 422 may also be transmitted orreceived over a network 420 via network interface card 408.

FIGS. 5-7 include block diagrams illustrating one or more methods oflogging debug data according to one or more examples of the presentspecification. In these figures, a core 504 includes a super queue (SQ)540 as well as a snoop queue 544. Core 504 may exist on asystem-on-a-chip with support circuitry 508. As illustrated herein, core504 may be any deterministic domain for emulation purposes, whilesupport circuitry 508 is any nondeterministic domain. Core 504communicates with support circuitry 508 via IDI 512, which includessupport circuitry to core 524 and core to support circuitry buses 528.IDI 512 also includes an IDI ODLAT 520. A system management agentoperating on support circuitry 508 may collect debug data that enablesreplay of operations on core 504, and may store them on emulation store530. Emulation store 530 may be configured to hold a debug datastructure that can be used for replay on an emulator as illustratedherein.

Several operations are illustrated over the course of FIGS. 5, 6, and 7,and are illustrated in a particular order for purposes of illustration.The order of operations should be understood to be nonlimiting, and isdisclosed only to aid in understanding of the operations.

As discussed herein, SQ 540 is a 32-element queue that is configured toqueue transactions between core 504 and support circuitry 508. Forexample, in the illustrated embodiment, the current highest priorityposition within SQ 540 is occupied by a read operation represented by afirst symbol, wherein the read operation may be, for example, a readfrom the cache line. This may result in a 64-byte data read, which maybe written to register files within core 504.

At operation 1, core 504 issues the read operation to support circuitry508 via CTU bus 520.

The read operation is an outbound operation that may not need to bepreserved to ensure replayability of core 504.

In this example, SQ 540 is divided into two regions, namely a firstoperational region 544, and a second shadow region 548. Operationalregion 544 is the region that provides actual operations to supportcircuitry 508. Shadow region 548 includes corresponding blog entries,that need not provide a one-to-one correlation with operational region544. The read request is not captured, since it is an output from thedeterministic domain, and may be generated by the emulator duringreplay.

At operation 2, when core 504 issues a read, SQ 540 stores acorresponding transaction write-to, represented by a second symbol, inshadow segment 548 of SQ 540. Write-to is stored in anticipation thatthe read will result in a corresponding write-back to core 504.

Because IDI 512 is 32 bytes wide, the read from cache which is 64 bytesis returned in two stages. At operation 3, write 0 is returned via UTCbus 524.

At operation four, the data returned are written to entry 0 in SQ 540.

At operation 5, a corresponding entry is written to shadow segment 548,including a timestamp D0 for the write 0.

At operation 6, write 1 is returned via UTC 524.

At operation 7, the returned data are stored in entry 0 of SQ 540, againbeing 32 bytes wide.

At operation 8, a timestamp D1 is also stored for write 1 in shadowsegment 548. Finally, a response with an appropriate timestamp may alsobe stored in shadow segment 548, thus creating a complete transactionalrecord of the write that resulted from the initial read. Note that theread itself may not need to be logged to re-create or replay thetransaction. Rather, during the replay, the corresponding read may begenerated internally to the emulator.

Turning to FIG. 6, at operation 9, the data returned are consumed by theoriginal transaction. The read command has been transformed to a write,while the payload of the write is the same data that were previouslyreturned by the read. Write 1 is written to a specially preallocateddebug region in memory. Write 2 follows write 1 and is written to thesame preallocated debug region, such as emulation store 530. Write 1includes the timestamp of the data returned and the timestamp of theresponse and its type. In this example, write 1 may have an evenaddress, and write 1 is written only to the even address for ease ofdecoding a valid trace. Note that when the transaction does not includedata (such as a write) then there may be nonvolatiles in trace memory.

Write 2 may have an odd address. Write 2 may be written only to oddaddresses, and may also include a valid data bit indicating that itsassociated even address does indeed include valid data.

FIG. 7 illustrates logging of snoops and asynchronous events via ODLAT520.

Snoops and asynchronous events may be captured by an associated packetin IDI ODLAT 520. IDI ODLAT 520 may be a hardware module located on IDI512 that monitors the traffic of IDI 512. IDI ODLAT may matchappropriate traffic and when traffic is matched, IDI ODLAT issues apacket including the important attributes of the transaction and thetimestamp.

The packet may be sent outbound via a special debug bus called debugtrace fabric 570. Once DTF 570 issues the packet, a trace aggregator mayissue it out through memory or an external port.

IDI ODLAT 520 may be used to capture transactions that are not relatedto SQ 540, such as snoops and APIC. Snoops may be stored on a separatesnoop queue 544.

IDI ODLAT 520 may also be used for capturing asynchronous events such ascredit occurrences, power management, and similar. Advantageously, IDIODLAT is less intrusive to the main flow, and may work when the mainflow is blocked, such as via a lock scenario.

By capturing heavy inbound transactions, as well as asynchronoustransactions in snoops, an emulator may be able to use data in emulationstore 530 to accurately reproduce the state of core 504.

Note that in the preceding figures, outbound transactions are notcaptured. However, in one embodiment, there may be some value inopportunistically or periodically capturing outbound transactions. Inthis case, opportunistic capturing may be capturing when there is sparecapacity, or when the bus is not busy. Periodic capturing may becapturing on a regular schedule, or some combination of opportunisticand periodic capturing.

A value in opportunistically or periodically capturing outboundtransactions is that these may be used to provide synchronization. Forexample, it may be desirable to periodically compare outboundtransactions that are logged to outbound transactions being generatedinternally within the emulator. If the outbound transactions match, thenthe designer can have confidence that the simulation is accuratelymatching the real core 504. If the emulator gets out of sync with core504, then the system designer may be able to determine that some errorhas occurred, or that there is some misconfiguration in the emulatoritself or in the debug data structure. Thus, opportunistic or periodiccapturing of outbound transactions may provide useful synchronizationfeatures.

Certain of the figures below detail example architectures and systems toimplement embodiments of the above. In some embodiments, one or morehardware components and/or instructions described above are emulated asdetailed below, or implemented as software modules.

In certain examples, instruction(s) may be embodied in a “genericvector-friendly instruction format,” which is detailed below. In otherembodiments, another instruction format is used. The description belowof the write mask registers, various data transformations (swizzle,broadcast, etc.), addressing, etc. is generally applicable to thedescription of the embodiments of the instruction(s) above.Additionally, example systems, architectures, and pipelines are detailedbelow. Embodiments of the instruction(s) above may be executed on thosesystems, architectures, and pipelines, but are not limited to thosedetailed.

An instruction set may include one or more instruction formats. A giveninstruction format may define various fields (e.g., number of bits,location of bits) to specify, among other things, the operation to beperformed (e.g., opcode) and the operand(s) on which that operation isto be performed and/or other data field(s) (e.g., mask). Someinstruction formats are further broken down though the definition ofinstruction templates (or subformats). For example, the instructiontemplates of a given instruction format may be defined to have differentsubsets of the instruction format's fields (the included fields aretypically in the same order, but at least some have different bitpositions because there are fewer fields included) and/or defined tohave a given field interpreted differently. Thus, each instruction of anISA is expressed using a given instruction format (and, if defined, in agiven one of the instruction templates of that instruction format) andincludes fields for specifying the operation and the operands. In oneembodiment, an example ADD instruction has a specific opcode and aninstruction format that includes an opcode field to specify that opcodeand operand fields to select operands (source1/destination and source2);and an occurrence of this ADD instruction in an instruction stream willhave specific contents in the operand fields that select specificoperands. A set of SIMD extensions referred to as the advanced vectorextensions (AVXs) (AVX1 and AVX2), and using the vector extensions (VEX)coding scheme has been released and/or published (e.g., see Intel® 64and IA-32 Architectures Software Developer's Manual, September 9014; andsee Intel® Advanced Vector Extensions Programming Reference, October9014).

Example Instruction Formats

Embodiments of the instruction(s) described herein may be embodied indifferent formats. Additionally, example systems, architectures, andpipelines are detailed below. Embodiments of the instruction(s) may beexecuted on such systems, architectures, and pipelines, but are notlimited to those detailed.

Generic Vector-Friendly Instruction Format

A vector-friendly instruction format is an instruction format that issuited for vector instructions (e.g., there are certain fields specificto vector operations). While embodiments are described in which bothvector and scalar operations are supported through the vector-friendlyinstruction format, alternative embodiments use only vector operationsthrough the vector-friendly instruction format.

FIGS. 8a-8b are block diagrams illustrating a generic vector-friendlyinstruction format and instruction templates thereof according toembodiments of the specification. FIG. 8a is a block diagramillustrating a generic vector-friendly instruction format and class Ainstruction templates thereof according to embodiments of thespecification; while FIG. 8b is a block diagram illustrating the genericvector-friendly instruction format and class B instruction templatesthereof according to embodiments of the specification. Specifically, ageneric vector-friendly instruction format 800 for which are definedclass A and class B instruction templates, both of which include nomemory access 805 instruction templates and memory access 820instruction templates. The term generic in the context of thevector-friendly instruction format refers to the instruction format notbeing tied to any specific instruction set.

Embodiments of the specification will be described in which thevector-friendly instruction format supports the following: a 64 bytevector operand length (or size) with 32 bit (4 byte) or 64 bit (8 byte)data element widths (or sizes) (and thus, a 64 byte vector consists ofeither 16 doubleword-size elements or alternatively, 8 quadword-sizeelements); a 64 byte vector operand length (or size) with 16 bit (2byte) or 8 bit (1 byte) data element widths (or sizes); a 32 byte vectoroperand length (or size) with 32 bit (4 byte), 64 bit (8 byte), 16 bit(2 byte), or 8 bit (1 byte) data element widths (or sizes); and a 16byte vector operand length (or size) with 32 bit (4 byte), 64 bit (8byte), 16 bit (2 byte), or 8 bit (1 byte) data element widths (orsizes); alternative embodiments may support more, less and/or differentvector operand sizes (e.g., 956 byte vector operands) with more, less,or different data element widths (e.g., 828 bit (16 byte) data elementwidths).

The class A instruction templates in FIG. 8a include: 1) within the nomemory access 805 instruction templates there is shown a no memoryaccess, full round control type operation 810 instruction template and ano memory access, data transform type operation 815 instructiontemplate; and 2) within the memory access 820 instruction templatesthere is shown a memory access, temporal 825 instruction template and amemory access, nontemporal 830 instruction template. The class Binstruction templates in FIG. 8b include: 1) within the no memory access805 instruction templates there is shown a no memory access, write maskcontrol, partial round control type operation 812 instruction templateand a no memory access, write mask control, VSIZE type operation 817instruction template; and 2) within the memory access 820 instructiontemplates there is shown a memory access, write mask control 827instruction template.

The generic vector-friendly instruction format 800 includes thefollowing fields listed below in the order illustrated in FIGS. 1a -1 b.

Format field 840—a specific value (an instruction format identifiervalue) in this field uniquely identifies the vector-friendly instructionformat, and thus occurrences of instructions in the vector-friendlyinstruction format in instruction streams. As such, this field isoptional in the sense that it is not needed for an instruction set thathas only the generic vector-friendly instruction format.

Base operation field 842—its content distinguishes different baseoperations.

Register index field 844—its content, directly or through addressgeneration, specifies the locations of the source and destinationoperands, be they in registers or in memory. These include a sufficientnumber of bits to select N registers from a P×Q (e.g. 32×1212, 16×828,32×8024, 64×8024) register file. While in one embodiment N may be up tothree sources and one destination register, alternative embodiments maysupport more or fewer sources and destination registers (e.g., maysupport up to two sources where one of these sources also acts as thedestination, may support up to three sources where one of these sourcesalso acts as the destination, or may support up to two sources and onedestination).

Modifier field 846—its content distinguishes occurrences of instructionsin the generic vector instruction format that specify memory access fromthose that do not; that is, between no memory access 805 instructiontemplates and memory access 820 instruction templates. Memory accessoperations read and/or write to the memory hierarchy (in some casesspecifying the source and/or destination addresses using values inregisters), while non-memory access operations do not (e.g., the sourceand destinations are registers). While in one embodiment this field alsoselects between three different ways to perform memory addresscalculations, alternative embodiments may support more, fewer, ordifferent ways to perform memory address calculations.

Augmentation operation field 850—its content distinguishes which one ofa variety of different operations to be performed in addition to thebase operation. This field is context specific. In one embodiment of thespecification, this field is divided into a class field 868, an alphafield 852, and a beta field 854. The augmentation operation field 850allows common groups of operations to be performed in a singleinstruction rather than 2, 3, or 4 instructions.

Scale field 860—its content allows for the scaling of the index field'scontent for memory address generation (e.g., for address generation thatuses 2^(scale)*index+base).

Displacement Field 862A—its content is used as part of memory addressgeneration (e.g., for address generation that uses2^(scale)*index+base+displacement).

Displacement Factor Field 862B (note that the juxtaposition ofdisplacement field 862A directly over displacement factor field 862Bindicates one or the other is used)—its content is used as part ofaddress generation; it specifies a displacement factor that is to bescaled by the size of a memory access (N)—where N is the number of bytesin the memory access (e.g., for address generation that uses2^(scale)*index+base+scaled displacement). Redundant low-order bits areignored and hence, the displacement factor field's content is multipliedby the memory operand's total size (N) in order to generate the finaldisplacement to be used in calculating an effective address. The valueof N is determined by the processor hardware at runtime based on thefull opcode field 874 (described later herein) and the data manipulationfield 854C. The displacement field 862A and the displacement factorfield 862B are optional in the sense that they are not used for the nomemory access 805 instruction templates and/or different embodiments mayimplement only one or none of the two.

Data element width field 864—its content distinguishes which one of anumber of data element widths is to be used (in some embodiments, forall instructions; in other embodiments, for only some of theinstructions). This field is optional in the sense that it is not neededif only one data element width is supported and/or data element widthsare supported using some aspect of the opcodes.

Write mask field 870—its content controls, on a per data elementposition basis, whether that data element position in the destinationvector operand reflects the result of the base operation andaugmentation operation. Class A instruction templates supportmerging-write masking, while class B instruction templates support bothmerging and zeroing-write masking. When merging, vector masks allow anyset of elements in the destination to be protected from updates duringthe execution of any operation (specified by the base operation and theaugmentation operation)—in one embodiment, preserving the old value ofeach element of the destination where the corresponding mask bit has a0. In contrast, when zeroing vector masks allow any set of elements inthe destination to be zeroed during the execution of any operation(specified by the base operation and the augmentation operation), in oneembodiment, an element of the destination is set to 0 when thecorresponding mask bit has a 0 value. A subset of this functionality isthe ability to control the vector length of the operation beingperformed (that is, the span of elements being modified, from the firstto the last one); however, it is not necessary that the elements thatare modified be consecutive. Thus, the write mask field 870 allows forpartial vector operations, including loads, stores, arithmetic, logical,etc. While embodiments of the specification are described in which thewrite mask field's 870 content selects one of a number of write maskregisters that contains the write mask to be used (and thus the writemask field's 870 content indirectly identifies that masking to beperformed), alternative embodiments instead or additionally allow themask write field's 870 content to directly specify the masking to beperformed.

Immediate field 872—its content allows for the specification of animmediate. This field is optional in the sense that is it not present inan implementation of the generic vector-friendly format that does notsupport immediate and it is not present in instructions that do not usean immediate.

Class field 868—its content distinguishes between different classes ofinstructions. With reference to FIGS. 1a-1b , the contents of this fieldselect between class A and class B instructions. In FIGS. 1a-1b ,rounded corner squares are used to indicate a specific value is presentin a field (e.g., class A 868A and class B 868B for the class field 868respectively in FIGS. 1a-1b ).

Instruction Templates of Class A

In the case of the non-memory access 805 instruction templates of classA, the alpha field 852 is interpreted as an RS field 852A, whose contentdistinguishes which one of the different augmentation operation typesare to be performed (e.g., round 852A.1 and data transform 852A.2 arerespectively specified for the no memory access, round type operation810 and the no memory access, data transform type operation 815instruction templates), while the beta field 854 distinguishes which ofthe operations of the specified type is to be performed. In the nomemory access 805 instruction templates, the scale field 860, thedisplacement field 862A, and the displacement scale filed 862B are notpresent.

No-Memory Access Instruction Templates—Full Round Control Type Operation

In the no memory access full round control type operation 810instruction template, the beta field 854 is interpreted as a roundcontrol field 854A, whose content provides static rounding. While in thedescribed embodiments of the specification the round control field 854Aincludes a suppress all floating point exceptions (SAE) field 856 and around operation control field 858, alternative embodiments may encodeboth these concepts into the same field or only have one or the other ofthese concepts/fields (e.g., may have only the round operation controlfield 858).

SAE field 856—its content distinguishes whether or not to disable theexception event reporting; when the SAE field's 856 content indicatessuppression is enabled, a given instruction does not report any kind offloating-point exception flag and does not raise any floating pointexception handler.

Round operation control field 858—its content distinguishes which one ofa group of rounding operations to perform (e.g., round-up, round-down,round-towards-zero and round-to-nearest). Thus, the round operationcontrol field 858 allows for the changing of the rounding mode on a perinstruction basis. In one embodiment of the specification where aprocessor includes a control register for specifying rounding modes, theround operation control field's 850 content overrides that registervalue.

No Memory Access Instruction Templates—Data Transform Type Operation

In the no memory access data transform type operation 815 instructiontemplate, the beta field 854 is interpreted as a data transform field854B, whose content distinguishes which one of a number of datatransforms is to be performed (e.g., no data transform, swizzle,broadcast).

In the case of a memory access 820 instruction template of class A, thealpha field 852 is interpreted as an eviction hint field 852B, whosecontent distinguishes which one of the eviction hints is to be used (inFIG. 8a , temporal 852B.1 and nontemporal 852B.2 are respectivelyspecified for the memory access, temporal 825 instruction template andthe memory access, nontemporal 830 instruction template), while the betafield 854 is interpreted as a data manipulation field 854C, whosecontent distinguishes which one of a number of data manipulationoperations (also known as primitives) is to be performed (e.g., nomanipulation; broadcast; up conversion of a source; and down conversionof a destination). The memory access 820 instruction templates includethe scale field 860, and optionally the displacement field 862A or thedisplacement scale field 862B.

Vector memory instructions perform vector loads from and vector storesto memory, with conversion support. As with regular vector instructions,vector memory instructions transfer data from/to memory in a dataelement-wise fashion, with the elements that are actually transferred asdictated by the contents of the vector mask that is selected as thewrite mask.

Memory Access Instruction Templates—Temporal

Temporal data is data likely to be reused soon enough to benefit fromcaching. This is, however, a hint, and different processors mayimplement it in different ways, including ignoring the hint entirely.

Memory Access Instruction Templates—Nontemporal

Nontemporal data is data unlikely to be reused soon enough to benefitfrom caching in the 1st-level cache and should be given priority foreviction. This is, however, a hint, and different processors mayimplement it in different ways, including ignoring the hint entirely.

Instruction Templates of Class B

In the case of the instruction templates of class B, the alpha field 852is interpreted as a write mask control (Z) field 852C, whose contentdistinguishes whether the write masking controlled by the write maskfield 870 should be a merging or a zeroing.

In the case of the non-memory access 805 instruction templates of classB, part of the beta field 854 is interpreted as an RL field 857A, whosecontent distinguishes which one of the different augmentation operationtypes are to be performed (e.g., round 857A.1 and vector length (VSIZE)857A.2 are respectively specified for the no memory access, write maskcontrol, partial round control type operation 812 instruction templateand the no memory access, write mask control, VSIZE type operation 817instruction template), while the rest of the beta field 854distinguishes which of the operations of the specified type is to beperformed. In the no memory access 805 instruction templates, the scalefield 860, the displacement field 862A, and the displacement scale field862B are not present.

In the no memory access, write mask control, partial round control typeoperation 810 instruction template, the rest of the beta field 854 isinterpreted as a round operation field 859A and exception eventreporting is disabled (a given instruction does not report any kind offloating-point exception flag and does not raise any floating pointexception handler).

Round operation control field 859A—just as round operation control field858, its content distinguishes which one of a group of roundingoperations to perform (e.g., round-up, round-down, round-towards-zeroand round-to-nearest). Thus, the round operation control field 859Aallows for the changing of the rounding mode on a per instruction basis.In one embodiment of the specification where a processor includes acontrol register for specifying rounding modes, the round operationcontrol field's 850 content overrides that register value.

In the no memory access, write mask control, VSIZE type operation 817instruction template, the rest of the beta field 854 is interpreted as avector length field 859B, whose content distinguishes which one of anumber of data vector lengths is to be performed on (e.g., 828, 956, or1212 byte).

In the case of a memory access 820 instruction template of class B, partof the beta field 854 is interpreted as a broadcast field 857B, whosecontent distinguishes whether or not the broadcast type datamanipulation operation is to be performed, while the rest of the betafield 854 is interpreted by the vector length field 859B. The memoryaccess 820 instruction templates include the scale field 860, andoptionally the displacement field 862A or the displacement scale field862B.

With regard to the generic vector-friendly instruction format 800, afull opcode field 874 is shown including the format field 840, the baseoperation field 842, and the data element width field 864. While oneembodiment is shown where the full opcode field 874 includes all ofthese fields, the full opcode field 874 includes less than all of thesefields in embodiments that do not support all of them. The full opcodefield 874 provides the operation code (opcode).

The augmentation operation field 850, the data element width field 864,and the write mask field 870 allow these features to be specified on aper instruction basis in the generic vector-friendly instruction format.

The combination of write mask field and data element width field createtyped instructions in that they allow the mask to be applied based ondifferent data element widths.

The various instruction templates found within class A and class B arebeneficial in different situations. In some embodiments of thespecification, different processors or different cores within aprocessor may support only class A, only class B, or both classes. Forinstance, a high performance general purpose out-of-order core intendedfor general-purpose computing may support only class B, a core intendedprimarily for graphics and/or scientific (throughput) computing maysupport only class A, and a core intended for both may support both (ofcourse, a core that has some mix of templates and instructions from bothclasses but not all templates and instructions from both classes iswithin the purview of the specification). Also, a single processor mayinclude multiple cores, all of which support the same class or in whichdifferent cores support different classes. For instance, in a processorwith separate graphics and general purpose cores, one of the graphicscores intended primarily for graphics and/or scientific computing maysupport only class A, while one or more of the general purpose cores maybe high performance general purpose cores with out-of-order executionand register renaming intended for general-purpose computing thatsupports only class B. Another processor that does not have a separategraphics core may include one more general purpose in-order orout-of-order cores that support both class A and class B. Of course,features from one class may also be implemented in the other class indifferent embodiments of the specification. Programs written in a highlevel language would be put (e.g., just in time compiled or staticallycompiled) into an variety of different executable forms, including: 1) aform having only instructions of the class or classes supported by thetarget processor for execution; or 2) a form having alternative routineswritten using different combinations of the instructions of all classesand having control flow code that selects the routines to execute basedon the instructions supported by the processor which is currentlyexecuting the code.

Example Specific Vector-Friendly Instruction Format

FIGS. 9a-9d are block diagrams illustrating an example specificvector-friendly instruction format according to embodiments of thespecification. FIGS. 9a-9d show a specific vector-friendly instructionformat 900 that is specific in the sense that it specifies the location,size, interpretation, and order of the fields, as well as values forsome of those fields. The specific vector-friendly instruction format900 may be used to extend the x86 instruction set, and thus some of thefields are similar or the same as those used in the existing x86instruction set and extension thereof (e.g., AVX). This format remainsconsistent with the prefix encoding field, real opcode byte field, MODR/M field, SIB field, displacement field, and immediate fields of theexisting x86 instruction set with extensions. The fields from FIGS. 8aand 8b into which the fields from FIGS. 9a-9d map are illustrated.

It should be understood that, although embodiments of the specificationare described with reference to the specific vector-friendly instructionformat 900 in the context of the generic vector-friendly instructionformat 800 for illustrative purposes, the present specification is notlimited to the specific vector-friendly instruction format 900 exceptwhere claimed. For example, the generic vector-friendly instructionformat 800 contemplates a variety of possible sizes for the variousfields, while the specific vector-friendly instruction format 900 isshown as having fields of specific sizes. By way of particular example,while the data element width field 864 is illustrated as a one bit fieldin the specific vector-friendly instruction format 900, the presentspecification is not so limited (that is, the generic vector-friendlyinstruction format 800 contemplates other sizes of the data elementwidth field 864).

The generic vector-friendly instruction format 800 includes thefollowing fields listed below in the order illustrated in FIG. 9 a.

EVEX Prefix (Bytes 0-3) 902—is encoded in a four-byte form.

Format Field 840 (EVEX Byte 0, bits [7:0])—the first byte (EVEX Byte 0)is the format field 840 and it contains 0×62 (the unique value used fordistinguishing the vector-friendly instruction format in oneembodiment).

The second through fourth bytes (EVEX Bytes 1-3) include a number of bitfields providing specific capability.

REX field 905 (EVEX Byte 1, bits [7-5])—consists of an EVEX.R bit field(EVEX Byte 1, bit [7]-R), EVEX.X bit field (EVEX byte 1, bit [6]-X), and857 BEX byte 1, bit[5]-B). The EVEX.R, EVEX.X, and EVEX.B bit fieldsprovide the same functionality as the corresponding VEX bit fields, andare encoded using 1s complement form, i.e. ZMM0 is encoded as 8111B,ZMM15 is encoded as 0000B. Other fields of the instructions encode thelower three bits of the register indexes as is known in the art (rrr,xxx, and bbb), so that Rrrr, Xxxx, and Bbbb may be formed by addingEVEX.R, EVEX.X, and EVEX. B.

REX′ field 810—this is the first part of the REX′ field 810 and is theEVEX.R′ bit field (EVEX Byte 1, bit [4]-R′) that is used to encodeeither the upper 16 or lower 16 of the extended 32 register set. In oneembodiment, this bit, along with others as indicated below, is stored inbit inverted format to distinguish (in the well-known x86 32-bit mode)from the BOUND instruction, whose real opcode byte is 62, but does notaccept in the MOD RIM field (described below) the value of 11 in the MODfield; other embodiments do not store this and the other indicated bitsbelow in the inverted format. A value of 1 is used to encode the lower16 registers. In other words, R′Rrrr is formed by combining EVEX.R′,EVEX.R, and the other RRR from other fields.

Opcode map field 915 (EVEX byte 1, bits [3:0]-mmmm)—its content encodesan implied leading opcode byte (OF, OF 38, or OF 3).

Data element width field 864 (EVEX byte 2, bit [7]-W)—is represented bythe notation EVEX.W. EVEX.W is used to define the granularity (size) ofthe datatype (either 32-bit data elements or 64-bit data elements).

EVEX.vvvv 920 (EVEX Byte 2, bits [6:3]-vvvv)—the role of EVEX.vvvv mayinclude the following: 1) EVEX.vvvv encodes the first source registeroperand, specified in inverted (1s complement) form and is valid forinstructions with 2 or more source operands; 2) EVEX.vvvv encodes thedestination register operand, specified in 1s complement form forcertain vector shifts; or 3) EVEX.vvvv does not encode any operand, thefield is reserved and should contain 8111b. Thus, EVEX.vvvv field 920encodes the 4 low-order bits of the first source register specifierstored in inverted (1s complement) form. Depending on the instruction,an extra different EVEX bit field is used to extend the specifier sizeto 32 registers.

EVEX.U 868 Class field (EVEX byte 2, bit [2]-U)—if EVEX.0=0, itindicates class A or EVEX.U0; if EVEX.0=1, it indicates class B orEVEX.U1.

Prefix encoding field 925 (EVEX byte 2, bits [1:0]-pp)—providesadditional bits for the base operation field. In addition to providingsupport for the legacy SSE instructions in the EVEX prefix format, thisalso has the benefit of compacting the SIMD prefix (rather thanrequiring a byte to express the SIMD prefix, the EVEX prefix requiresonly 2 bits). In one embodiment, to support legacy SSE instructions thatuse an SIMD prefix (66H, F2H, F3H) in both the legacy format and in theEVEX prefix format, these legacy SIMD prefixes are encoded into the SIMDprefix encoding field; and at runtime are expanded into the legacy SIMDprefix prior to being provided to the decoder's PLA (so the PLA canexecute both the legacy and EVEX format of these legacy instructionswithout modification). Although newer instructions could use the EVEXprefix encoding field's content directly as an opcode extension, certainembodiments expand in a similar fashion for consistency but allow fordifferent meanings to be specified by these legacy SIMD prefixes. Analternative embodiment may redesign the PLA to support the 2 bit SIMDprefix encodings, and thus not require the expansion.

Alpha field 852 (EVEX byte 3, bit [7]-EH; also known as EVEX.eh,EVEX.rs, EVEX.rl, EVEX.write mask control, and EVEX.n; also illustratedwith a)—as previously described, this field is context specific.

Beta field 854 (EVEX byte 3, bits [6:4]-SSS, also known as EVEX.s₂₋₀,EVEX.r₂₋₀, EVEX.rr1, EVEX.LL0, EVEX.LLB; also illustrated with βββ)—aspreviously described, this field is context specific.

REX′ field 810—this is the remainder of the REX′ field and is theEVEX.V′ bit field (EVEX Byte 3, bit [3]-V′) that may be used to encodeeither the upper 16 or lower 16 of the extended 32 register set. Thisbit is stored in bit inverted format. A value of 1 is used to encode thelower 16 registers. In other words, V′VVVV is formed by combiningEVEX.V′, EVEX.vvvv.

Write mask field 870 (EVEX byte 3, bits [2:0]-kkk)—its content specifiesthe index of a register in the write mask registers as previouslydescribed. In one embodiment, the specific value EVEX.kkk=000 has aspecial behavior implying no write mask is used for the particularinstruction (this may be implemented in a variety of ways including theuse of a write mask hardwired to all ones or hardware that bypasses themasking hardware).

Real Opcode Field 930 (Byte 4) is also known as the opcode byte. Part ofthe opcode is specified in this field.

MOD RIM Field 940 (Byte 5) includes MOD field 942, Reg field 944, andRIM field 946. As previously described, the MOD field's 942 contentdistinguishes between memory access and non-memory access operations.The role of Reg field 944 can be summarized to two situations: encodingeither the destination register operand or a source register operand, orbe treated as an opcode extension and not used to encode any instructionoperand. The role of RIM field 946 may include the following: encodingthe instruction operand that references a memory address, or encodingeither the destination register operand or a source register operand.

Scale, Index, Base (SIB) Byte (Byte 6)—as previously described, thescale field's 850 content is used for memory address generation. SIB.xxx954 and SIB.bbb 956—the contents of these fields have been previouslyreferred to with regard to the register indexes Xxxx and Bbbb.

Displacement field 862A (Bytes 7-10)—when MOD field 942 contains 10,bytes 7-10 are the displacement field 862A, and it works the same as thelegacy 32-bit displacement (disp32) and works at byte granularity.

Displacement factor field 862B (Byte 7)—when MOD field 942 contains 01,byte 7 is the displacement factor field 862B. The location of this fieldis the same as that of the legacy x86 instruction set 8-bit displacement(disp8), which works at byte granularity. Since disp8 is sign extended,it can only address between 828 and 827-byte offsets; in terms of 64byte cache lines, disp8 uses 8 bits that can be set to only four reallyuseful values −828, −64, 0, and 64; since a greater range is oftenneeded, disp32 is used; however, disp32 requires 4 bytes. In contrast todisp8 and disp32, the displacement factor field 862B is areinterpretation of disp8; when using displacement factor field 862B,the actual displacement is determined by the content of the displacementfactor field multiplied by the size of the memory operand access (N).This type of displacement is referred to as disp8*N. This reduces theaverage instruction length (a single byte used for the displacement butwith a much greater range). Such compressed displacement is based on theassumption that the effective displacement is a multiple of thegranularity of the memory access, and hence, the redundant low-orderbits of the address offset do not need to be encoded. In other words,the displacement factor field 862B substitutes the legacy x86instruction set 8-bit displacement. Thus, the displacement factor field862B is encoded the same way as an x86 instruction set 8-bitdisplacement (so no changes in the Mod RM/SIB encoding rules) with theonly exception that disp8 is overloaded to disp8*N. In other words,there are no changes in the encoding rules or encoding lengths but onlyin the interpretation of the displacement value by hardware (which needsto scale the displacement by the size of the memory operand to obtain abyte-wise address offset). Immediate field 872 operates as previouslydescribed.

Full Opcode Field

FIG. 9b is a block diagram illustrating the fields of the specificvector-friendly instruction format 900 that make up the full opcodefield 874 according to one embodiment. Specifically, the full opcodefield 874 includes the format field 840, the base operation field 842,and the data element width (W) field 864. The base operation field 842includes the prefix encoding field 925, the opcode map field 915, andthe real opcode field 930.

Register Index Field

FIG. 9c is a block diagram illustrating the fields of the specificvector-friendly instruction format 900 that make up the register indexfield 844 according to one embodiment. Specifically, the register indexfield 844 includes the REX field 905, the REX′ field 910, the MODR/M.regfield 944, the MODR/M.r/m field 946, the VVVV field 920, xxx field 954,and the bbb field 956.

Augmentation Operation Field

FIG. 9d is a block diagram illustrating the fields of the specificvector-friendly instruction format 900 that make up the augmentationoperation field 850 according to one embodiment. When the class (U)field 868 contains 0, it signifies EVEX.U0 (class A 868A); when itcontains 1, it signifies EVEX.U1 (class B 868B). When U=0 and the MODfield 942 contains 11 (signifying a no memory access operation), thealpha field 852 (EVEX byte 3, bit [7]-EH) is interpreted as the rs field852A. When the rs field 852A contains a 1 (round 852A.1), the beta field854 (EVEX byte 3, bits [6:4]-SSS) is interpreted as the round controlfield 854A. The round control field 854A includes a one bit SAE field856 and a two bit round operation field 858. When the rs field 852Acontains a 0 (data transform 852A.2), the beta field 854 (EVEX byte 3,bits [6:4]-SSS) is interpreted as a three bit data transform field 854B.When U=0 and the MOD field 942 contains 00, 01, or 10 (signifying amemory access operation), the alpha field 852 (EVEX byte 3, bit [7]-EH)is interpreted as the eviction hint (EH) field 852B and the beta field854 (EVEX byte 3, bits [6:4]-SSS) is interpreted as a three bit datamanipulation field 854C.

When U=1, the alpha field 852 (EVEX byte 3, bit [7]-EH) is interpretedas the write mask control (Z) field 852C. When U=1 and the MOD field 942contains 11 (signifying a no memory access operation), part of the betafield 854 (EVEX byte 3, bit [4]-S₀) is interpreted as the RL field 857A;when it contains a 1 (round 857A.1) the rest of the beta field 854 (EVEXbyte 3, bit [6-5]-S₂₋₁) is interpreted as the round operation field859A, while when the RL field 857A contains a 0 (VSIZE 857.A2) the restof the beta field 854 (EVEX byte 3, bit [6-5]-S₂₋₁) is interpreted asthe vector length field 859B (EVEX byte 3, bit [6-5]-L₁₋₀). When U=1 andthe MOD field 942 contains 00, 01, or 10 (signifying a memory accessoperation), the beta field 854 (EVEX byte 3, bits [6:4]-SSS) isinterpreted as the vector length field 859B (EVEX byte 3, bit[6-5]-L₁₋₀) and the broadcast field 857B (EVEX byte 3, bit [4]-B).

Example Register Architecture

FIG. 10 is a block diagram of a register architecture 1000 according toone embodiment. In the embodiment illustrated, there are 32 vectorregisters 1010 that are 1212 bits wide; these registers are referencedas zmm0 through zmm31. The lower order 956 bits of the lower 16 zmmregisters are overlaid on registers ymm0-16. The lower order 828 bits ofthe lower 16 zmm registers (the lower order 828 bits of the ymmregisters) are overlaid on registers xmm0-15. The specificvector-friendly instruction format 900 operates on these overlaidregister files as illustrated in the below tables.

Adjustable Vector Length Class Operations Registers Instruction A (FIG.810, 815, zmm registers (the Templates that 8a; U = 0) 825, 830 vectorlength is 64 do not include byte) the vector length B (FIG. 812 zmmregisters (the field 859B 8b; U = 1) vector length is 64 byte)Instruction B (FIG. 817, 827 zmm, ymm, or xmm templates that 8b; U = 1)registers (the vector do include the length is 64 byte, 32 vector lengthbyte, or 16 byte) field 859B depending on the vector length field 859B

In other words, the vector length field 859B selects between a maximumlength and one or more other shorter lengths, where each such shorterlength is half the length of the preceding length; and instructiontemplates without the vector length field 859B operate on the maximumvector length. Further, in one embodiment, the class B instructiontemplates of the specific vector-friendly instruction format 900 operateon packed or scalar single/double-precision floating point data andpacked or scalar integer data. Scalar operations are operationsperformed on the lowest order data element position in a zmm/ymm/xmmregister; the higher order data element positions are either left thesame as they were prior to the instruction or zeroed depending on theembodiment.

Write mask registers 1015—in the embodiment illustrated, there are 8write mask registers (k0 through k7), each 64 bits in size. In analternate embodiment, the write mask registers 1015 are 16 bits in size.As previously described, in one embodiment, the vector mask register k0cannot be used as a write mask; when the encoding that would normallyindicate k0 is used for a write mask, it selects a hardwired write maskof 0×FFFF, effectively disabling write masking for that instruction.

General-purpose registers 1025—in the embodiment illustrated, there aresixteen 64-bit general-purpose registers that are used along with theexisting x86 addressing modes to address memory operands. Theseregisters are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI,RSP, and R8 through R15.

Scalar floating point stack register file (x87 stack) 1045, on which isaliased the MMX packed integer flat register file 1050—in the embodimentillustrated, the x87 stack is an eight-element stack used to performscalar floating-point operations on 32/64/80-bit floating point datausing the x87 instruction set extension; while the MMX registers areused to perform operations on 64-bit packed integer data, as well as tohold operands for some operations performed between the MMX and XMMregisters.

Other embodiments may use wider or narrower registers. Additionally,other embodiments may use more, less, or different register files andregisters.

Example Core Architectures, Processors, and Computer Architectures

Processor cores may be implemented in different ways, for differentpurposes, and in different processors. For instance, implementations ofsuch cores may include: 1) a general purpose in-order core intended forgeneral-purpose computing; 2) a high performance general purposeout-of-order core intended for general-purpose computing; 3) a specialpurpose core intended primarily for graphics and/or scientific(throughput) computing. Implementations of different processors mayinclude: 1) a CPU including one or more general purpose in-order coresintended for general-purpose computing and/or one or more generalpurpose out-of-order cores intended for general-purpose computing; and2) a coprocessor including one or more special purpose cores intendedprimarily for graphics and/or scientific throughput. Such differentprocessors lead to different computer system architectures, which mayinclude: 1) the coprocessor on a separate chip from the CPU; 2) thecoprocessor on a separate die in the same package as a CPU; 3) thecoprocessor on the same die as a CPU (in which case, such a coprocessoris sometimes referred to as special purpose logic, such as integratedgraphics and/or scientific (throughput) logic, or as special purposecores); and 4) a system on a chip that may include on the same die thedescribed CPU (sometimes referred to as the application core(s) orapplication processor(s)), the above described coprocessor, andadditional functionality. Example core architectures are described next,followed by descriptions of example processors and computerarchitectures.

Example Core Architectures

In-Order and Out-of-Order Core Block Diagram

FIG. 11a is a block diagram illustrating both an example in-orderpipeline and an example register renaming, out-of-order issue/executionpipeline. FIG. 11b is a block diagram illustrating both an embodiment ofan in-order architecture core and an example register renaming,out-of-order issue/execution architecture core to be included in aprocessor. The solid lined boxes in FIGS. 11a-11b illustrate thein-order pipeline and in-order core, while the optional addition of thedashed, lined boxes illustrates the register renaming, out-of-orderissue/execution pipeline and core. Given that the in-order aspect is asubset of the out-of-order aspect, the out-of-order aspect will bedescribed.

In FIG. 11a , a processor pipeline 1100 includes a fetch stage 1102, alength decode stage 1104, a decode stage 1106, an allocation stage 1108,a renaming stage 1110, a scheduling (also known as a dispatch or issue)stage 1112, a register read/memory read stage 1114, an execute stage1116, a write back/memory write stage 1118, an exception handling stage1122, and a commit stage 1124.

FIG. 11b shows processor core 1190 including a front end unit 1130coupled to an execution engine unit 1150, and both are coupled to amemory unit 1170. The core 1190 may be a reduced instruction setcomputing (RISC) core, a complex instruction set computing (CISC) core,a very long instruction word (VLIW) core, or a hybrid or alternativecore type. As yet another option, the core 1190 may be a special-purposecore, such as, for example, a network or communication core, compressionengine, coprocessor core, general purpose computing graphics processingunit (GPGPU) core, graphics core, or the like.

The front end unit 1130 includes a branch prediction unit 1132 coupledto an instruction cache unit 1134, which is coupled to an instructiontranslation lookaside buffer (TLB) 1136, which is coupled to aninstruction fetch unit 1138, which is coupled to a decode unit 1140. Thedecode unit 1140 (or decoder) may decode instructions, and generate asan output one or more micro-operations, micro-code entry points,microinstructions, other instructions, or other control signals, whichare decoded from, or which otherwise reflect, or are derived from, theoriginal instructions. The decode unit 1140 may be implemented usingvarious different mechanisms. Examples of suitable mechanisms include,but are not limited to, look-up tables, hardware implementations,programmable logic arrays (PLAs), microcode read only memories (ROMs),etc. In one embodiment, the core 1190 includes a microcode ROM or othermedium that stores microcode for certain macroinstructions (e.g., indecode unit 1140 or otherwise within the front end unit 1130). Thedecode unit 1140 is coupled to a rename/allocator unit 1152 in theexecution engine unit 1150.

The execution engine unit 1150 includes the rename/allocator unit 1152coupled to a retirement unit 1154 and a set of one or more schedulerunit(s) 1156. The scheduler unit(s) 1156 represents any number ofdifferent schedulers, including reservation stations, centralinstruction window, etc. The scheduler unit(s) 1156 is coupled to thephysical register file(s) unit(s) 1158. Each of the physical registerfile(s) units 1158 represents one or more physical register files,different ones of which store one or more different data types, such asscalar integer, scalar floating point, packed integer, packed floatingpoint, vector integer, vector floating point, status (e.g., aninstruction pointer that is the address of the next instruction to beexecuted), etc. In one embodiment, the physical register file(s) unit1158 comprises a vector registers unit, a write mask registers unit, anda scalar registers unit. These register units may provide architecturalvector registers, vector mask registers, and general purpose registers.The physical register file(s) unit(s) 1158 is overlapped by theretirement unit 1154 to illustrate various ways in which registerrenaming and out-of-order execution may be implemented (e.g., using areorder buffer(s) and a retirement register file(s); using a futurefile(s), a history buffer(s), and a retirement register file(s); usingregister maps and a pool of registers; etc.). The retirement unit 1154and the physical register file(s) unit(s) 1158 are coupled to theexecution cluster(s) 1160. The execution cluster(s) 1160 includes a setof one or more execution units 1162 and a set of one or more memoryaccess units 1164. The execution units 1162 may perform variousoperations (e.g., shifts, addition, subtraction, multiplication) and onvarious types of data (e.g., scalar floating point, packed integer,packed floating point, vector integer, vector floating point). Whilesome embodiments may include a number of execution units dedicated tospecific functions or sets of functions, other embodiments may includeonly one execution unit or multiple execution units that all perform allfunctions. The scheduler unit(s) 1156, physical register file(s) unit(s)1158, and execution cluster(s) 1160 are shown as being possibly pluralbecause certain embodiments create separate pipelines for certain typesof data/operations (e.g., a scalar integer pipeline, a scalar floatingpoint/packed integer/packed floating point/vector integer/vectorfloating point pipeline, and/or a memory access pipeline that each havetheir own scheduler unit, physical register file(s) unit, and/orexecution cluster—and in the case of a separate memory access pipeline,certain embodiments are implemented in which only the execution clusterof this pipeline has the memory access unit(s) 1164). It should also beunderstood that where separate pipelines are used, one or more of thesepipelines may be out-of-order issue/execution and the rest in-order.

The set of memory access units 1164 is coupled to the memory unit 1170,which includes a data TLB unit 1172 coupled to a data cache unit 1174coupled to a level 2 (L2) cache unit 1176. In one embodiment, the memoryaccess units 1164 may include a load unit, a store address unit, and astore data unit, each of which is coupled to the data TLB unit 1172 inthe memory unit 1170. The instruction cache unit 1134 is further coupledto a level 2 (L2) cache unit 1176 in the memory unit 1170. The L2 cacheunit 1176 is coupled to one or more other levels of cache and eventuallyto a main memory.

By way of example, the register renaming, out-of-order issue/executioncore architecture may implement the pipeline 1100 as follows: 1) theinstruction fetch 1138 performs the fetch and length decoding stages1102 and 1104; 2) the decode unit 1140 performs the decode stage 1106;3) the rename/allocator unit 1152 performs the allocation stage 1108 andrenaming stage 1110; 4) the scheduler unit(s) 1156 performs the schedulestage 1112; 5) the physical register file(s) unit(s) 1158 and the memoryunit 1170 perform the register read/memory read stage 1114; theexecution cluster 1160 performs the execute stage 1116; 6) the memoryunit 1170 and the physical register file(s) unit(s) 1158 perform thewrite back/memory write stage 1118; 7) various units may be involved inthe exception handling stage 1122; and 8) the retirement unit 1154 andthe physical register file(s) unit(s) 1158 perform the commit stage1124.

The core 1190 may support one or more instruction sets (e.g., the x86instruction set (with some extensions that have been added with newerversions); the MIPS instruction set of MIPS Technologies of Sunnyvale,Calif.; the ARM instruction set (with optional additional extensionssuch as NEON) of ARM Holdings of Sunnyvale, Calif.), including theinstruction(s) described herein. In one embodiment, the core 1190includes logic to support a packed data instruction set extension (e.g.,AVX1, AVX2), thereby allowing the operations used by many multimediaapplications to be performed using packed data.

It should be understood that the core may support multithreading(executing two or more parallel sets of operations or threads), and maydo so in a variety of ways including time sliced multithreading,simultaneous multithreading (where a single physical core provides alogical core for each of the threads that physical core issimultaneously multithreading), or a combination thereof (e.g., timesliced fetching and decoding and simultaneous multithreading thereaftersuch as in the Intel® Hyperthreading technology).

While register renaming is described in the context of out-of-orderexecution, it should be understood that register renaming may be used inan in-order architecture. While the illustrated embodiment of theprocessor also includes separate instruction and data cache units1134/1174 and a shared L2 cache unit 1176, alternative embodiments mayhave a single internal cache for both instructions and data, such as,for example, a Level 1 (L1) internal cache, or multiple levels ofinternal cache. In some embodiments, the system may include acombination of an internal cache and an external cache that is externalto the core and/or the processor. Alternatively, all of the cache may beexternal to the core and/or the processor.

Example in-Order Core Architecture

FIGS. 12a-12b illustrate a block diagram of a more specific examplein-order core architecture, which core would be one of several logicblocks (including other cores of the same type and/or different types)in a chip. The logic blocks communicate through a high-bandwidthinterconnect network (e.g., a ring network) with some fixed functionlogic, memory IO interfaces, and other necessary IO logic, depending onthe application.

FIG. 12a is a block diagram of a single processor core, along with itsconnection to the on-die interconnect network 1202 and with its localsubset of the Level 2 (L2) cache 1204, according to one or moreembodiments. In one embodiment, an instruction decoder 1200 supports thex86 instruction set with a packed data instruction set extension. An L1cache 1206 allows low-latency accesses to cache memory into the scalarand vector units. While in one embodiment (to simplify the design), ascalar unit 1208 and a vector unit 1210 use separate register sets(respectively, scalar registers 1212 and vector registers 1214) and datatransferred between them is written to memory and then read back in froma level 1 (L1) cache 1206, other embodiments may use a differentapproach (e.g., use a single register set or include a communicationpath that allows data to be transferred between the two register fileswithout being written and read back).

The local subset of the L2 cache 1204 is part of a global L2 cache thatis divided into separate local subsets, one per processor core. Eachprocessor core has a direct access path to its own local subset of theL2 cache 1204. Data read by a processor core is stored in its L2 cachesubset 1204 and can be accessed quickly, in parallel with otherprocessor cores accessing their own local L2 cache subsets. Data writtenby a processor core is stored in its own L2 cache subset 1204 and isflushed from other subsets, if necessary. The ring network ensurescoherency for shared data. The ring network is bi-directional to allowagents such as processor cores, L2 caches and other logic blocks tocommunicate with each other within the chip. Each ring data-path is8012-bits wide per direction.

FIG. 12b is an expanded view of part of the processor core in FIG. 12aaccording to embodiments of the specification. FIG. 12b includes an L1data cache 1206A, part of the L1 cache 1204, as well as more detailregarding the vector unit 1210 and the vector registers 1214.Specifically, the vector unit 1210 is a 16-wide vector processing unit(VPU) (see the 16-wide ALU 1228), which executes one or more of integer,single-precision float, and double-precision float instructions. The VPUsupports swizzling the register inputs with swizzle unit 1220, numericconversion with numeric convert units 1222A-B, and replication withreplication unit 1224 on the memory input. Write mask registers 1226allow predicating resulting vector writes.

FIG. 13 is a block diagram of a processor 1300 that may have more thanone core, may have an integrated memory controller, and may haveintegrated graphics according to embodiments of the specification. Thesolid lined boxes in FIG. 13 illustrate a processor 1300 with a singlecore 1302A, a system agent 1310, a set of one or more bus controllerunits 1316, while the optional addition of the dashed lined boxesillustrates an alternative processor 1300 with multiple cores 1302A-N, aset of one or more integrated memory controller unit(s) 1314 in thesystem agent unit 1310, and special purpose logic 1308.

Thus, different implementations of the processor 1300 may include: 1) aCPU with the special purpose logic 1308 being integrated graphics and/orscientific (throughput) logic (which may include one or more cores), andthe cores 1302A-N being one or more general purpose cores (e.g., generalpurpose in-order cores, general purpose out-of-order cores, acombination of the two); 2) a coprocessor with the cores 1302A-N being alarge number of special purpose cores intended primarily for graphicsand/or scientific throughput; and 3) a coprocessor with the cores1302A-N being a large number of general purpose in-order cores. Thus,the processor 1300 may be a general-purpose processor, coprocessor orspecial-purpose processor, such as, for example, a network orcommunication processor, compression engine, graphics processor, GPGPU(general purpose graphics processing unit), a high-throughput manyintegrated core (MIC) coprocessor (including 30 or more cores), embeddedprocessor, or the like. The processor may be implemented on one or morechips. The processor 1300 may be a part of and/or may be implemented onone or more substrates using any of a number of process technologies,such as, for example, BiCMOS, CMOS, or NMOS.

The memory hierarchy includes one or more levels of cache within thecores, a set or one or more shared cache units 1306, and external memory(not shown) coupled to the set of integrated memory controller units1314. The set of shared cache units 1306 may include one or moremid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), orother levels of cache, a last level cache (LLC), and/or combinationsthereof. While in one embodiment a ring based interconnect unit 1312interconnects the integrated graphics logic 1308, the set of sharedcache units 1306, and the system agent unit 1310/integrated memorycontroller unit(s) 1314, alternative embodiments may use any number ofwell-known techniques for interconnecting such units. In one embodiment,coherency is maintained between one or more cache units 1306 and cores1302A-N.

In some embodiments, one or more of the cores 1302A-N are capable ofmulti-threading. The system agent 1310 includes those componentscoordinating and operating cores 1302A-N. The system agent unit 1310 mayinclude, for example, a power control unit (PCU) and a display unit. ThePCU may be or include logic and components needed for regulating thepower state of the cores 1302A-N and the integrated graphics logic 1308.The display unit is for driving one or more externally connecteddisplays.

The cores 1302A-N may be homogenous or heterogeneous in terms ofarchitecture instruction set; that is, two or more of the cores 1302A-Nmay be capable of executing the same instruction set, while others maybe capable of executing only a subset of that instruction set or adifferent instruction set.

Example Computer Architectures

FIGS. 14-17 are block diagrams of example computer architectures. Othersystem designs and configurations known in the arts for laptops,desktops, handheld PCs, personal digital assistants, engineeringworkstations, servers, network devices, network hubs, switches, embeddedprocessors, digital signal processors (DSPs), graphics devices, videogame devices, set-top boxes, micro controllers, cell phones, portablemedia players, hand held devices, and various other electronic devices,are also suitable. In general, a huge variety of systems or electronicdevices capable of incorporating a processor and/or other executionlogic as disclosed herein are generally suitable.

Referring now to FIG. 14, shown is a block diagram of a system 1400 inaccordance with one embodiment. The system 1400 may include one or moreprocessors 1410, 1415, which are coupled to a controller hub 1420. Inone embodiment the controller hub 1420 includes a graphics memorycontroller hub (GMCH) 1490 and an Input/Output Hub (IOH) 1450 (which maybe on separate chips); the GMCH 1490 includes memory and graphicscontrollers to which are coupled memory 1440 and a coprocessor 1445; theIOH 1450 couples input/output (IO) devices 1460 to the GMCH 1490.Alternatively, one or both of the memory and graphics controllers areintegrated within the processor (as described herein), the memory 1440and the coprocessor 1445 are coupled directly to the processor 1410, andthe controller hub 1420 in a single chip with the IOH 1450.

The optional nature of additional processors 1415 is denoted in FIG. 14with broken lines. Each processor 1410, 1415 may include one or more ofthe processing cores described herein and may be some version of theprocessor 1300.

The memory 1440 may be, for example, dynamic random access memory(DRAM), phase change memory (PCM), or a combination of the two. For atleast one embodiment, the controller hub 1420 communicates with theprocessor(s) 1410, 1415 via a multidrop bus, such as a frontside bus(FSB), point-to-point interface such as Ultra Path Interconnect (UPI),or similar connection 1495.

In one embodiment, the coprocessor 1445 is a special-purpose processor,such as, for example, a high-throughput MIC processor, a network orcommunication processor, compression engine, graphics processor, GPGPU,embedded processor, or the like. In one embodiment, controller hub 1420may include an integrated graphics accelerator.

There can be a variety of differences between the physical resources1410, 1415 in terms of a spectrum of metrics of merit includingarchitectural, microarchitectural, thermal, power consumptioncharacteristics, and the like.

In one embodiment, the processor 1410 executes instructions that controldata processing operations of a general type. Embedded within theinstructions may be coprocessor instructions. The processor 1410recognizes these coprocessor instructions as being of a type that shouldbe executed by the attached coprocessor 1445. Accordingly, the processor1410 issues these coprocessor instructions (or control signalsrepresenting coprocessor instructions) on a coprocessor bus or otherinterconnect, to coprocessor 1445. Coprocessor(s) 1445 accepts andexecutes the received coprocessor instructions.

Referring now to FIG. 15, shown is a block diagram of a first morespecific example system 1500. As shown in FIG. 15, multiprocessor system1500 is a point-to-point interconnect system, and includes a firstprocessor 1570 and a second processor 1580 coupled via a point-to-pointinterconnect 1550. Each of processors 1570 and 1580 may be some versionof the processor 1300. In one embodiment, processors 1570 and 1580 arerespectively processors 1410 and 1415, while coprocessor 1538 iscoprocessor 1445. In another embodiment, processors 1570 and 1580 arerespectively processor 1410 coprocessor 1445.

Processors 1570 and 1580 are shown including integrated memorycontroller (IMC) units 1572 and 1582, respectively. Processor 1570 alsoincludes as part of its bus controller units point-to-point (P-P)interfaces 1576 and 1578; similarly, second processor 1580 includes P-Pinterfaces 1586 and 1588. Processors 1570, 1580 may exchange informationvia a point-to-point (P-P) interface 1550 using P-P interface circuits1578, 1588. As shown in FIG. 15, IMCs 1572 and 1582 couple theprocessors to respective memories, namely a memory 1532 and a memory1534, which may be portions of main memory locally attached to therespective processors.

Processors 1570, 1580 may each exchange information with a chipset 1590via individual P-P interfaces 1552, 1554 using point to point interfacecircuits 1576, 1594, 1586, 1598. Chipset 1590 may optionally exchangeinformation with the coprocessor 1538 via a high-performance interface1539. In one embodiment, the coprocessor 1538 is a special-purposeprocessor, such as, for example, a high-throughput MIC processor, anetwork or communication processor, compression engine, graphicsprocessor, GPGPU, embedded processor, or the like.

A shared cache (not shown) may be included in either processor oroutside of both processors, yet connected with the processors via P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 1590 may be coupled to a first bus 1516 via an interface 1596.In one embodiment, first bus 1516 may be a peripheral componentinterconnect (PCI) bus, or a bus such as a PCI Express bus or anotherthird generation IO interconnect bus, by way of nonlimiting example.

As shown in FIG. 16, various IO devices 1514 may be coupled to first bus1516, along with a bus bridge 1518 which couples first bus 1516 to asecond bus 1520. In one embodiment, one or more additional processor(s)1515, such as coprocessors, high-throughput MIC processors, GPGPUs,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessor, are coupled to first bus 1516. In one embodiment, second bus1520 may be a low pin count (LPC) bus. Various devices may be coupled toa second bus 1520 including, for example, a keyboard and/or mouse 1522,communication devices 1527 and a storage unit 1528 such as a disk driveor other mass storage device which may include instructions or code anddata 1530, in one embodiment. Further, an audio IO 1524 may be coupledto the second bus 1520. Note that other architectures are possible. Forexample, instead of the point-to-point architecture of FIG. 15, a systemmay implement a multidrop bus or other such architecture.

Referring now to FIG. 16, shown is a block diagram of a second morespecific example system 1600. FIGS. 15 and 16 bear like referencenumerals, and certain aspects of FIG. 15 have been omitted from FIG. 16in order to avoid obscuring other aspects of FIG. 16.

FIG. 16 illustrates that the processors 1570, 1580 may includeintegrated memory and IO control logic (“CL”) 1572 and 1582,respectively. Thus, the CL 1572, 1582 include integrated memorycontroller units and include IO control logic. FIG. 16 illustrates thatnot only are the memories 1532, 1534 coupled to the CL 1572, 1582, butalso that IO devices 1614 are also coupled to the control logic 1572,1582. Legacy IO devices 1615 are coupled to the chipset 1590.

Referring now to FIG. 17, shown is a block diagram of a SoC 1700 inaccordance with an embodiment. Similar elements in FIG. 13 bear likereference numerals. Also, dashed lined boxes are optional features onmore advanced SoCs. In FIG. 17, an interconnect unit(s) 1702 is coupledto: an application processor 1710 which includes a set of one or morecores 1302A-N and shared cache unit(s) 1306; a system agent unit 1310; abus controller unit(s) 1316; an integrated memory controller unit(s)1314; a set of one or more coprocessors 1720 which may includeintegrated graphics logic, an image processor, an audio processor, and avideo processor; a static random access memory (SRAM) unit 1730; adirect memory access (DMA) unit 1732; and a display unit 1740 forcoupling to one or more external displays. In one embodiment, thecoprocessor(s) 1720 includes a special-purpose processor, such as, forexample, a network or communication processor, compression engine,GPGPU, a high-throughput MIC processor, embedded processor, or the like.

Embodiments of the mechanisms disclosed herein may be implemented inhardware, software, firmware, or a combination of such implementationapproaches. Some embodiments may be implemented as computer programs orprogram code executing on programmable systems comprising at least oneprocessor, a storage system (including volatile and nonvolatile memoryand/or storage elements), at least one input device, and at least oneoutput device.

Program code, such as code 1530 illustrated in FIG. 15, may be appliedto input instructions to perform the functions described herein andgenerate output information. The output information may be applied toone or more output devices, in known fashion. For purposes of thisapplication, a processing system includes any system that has aprocessor, such as, for example, a digital signal processor (DSP), amicrocontroller, an application-specific integrated circuit (ASIC), or amicroprocessor.

The program code may be implemented in a high level procedural or objectoriented programming language to communicate with a processing system.The program code may also be implemented in assembly or machinelanguage, if desired. In fact, the mechanisms described herein are notlimited in scope to any particular programming language. In any case,the language may be a compiled or interpreted language.

One or more aspects of at least one embodiment may be implemented byrepresentative instructions stored on a machine-readable medium whichrepresents various logic within the processor, which when read by amachine causes the machine to fabricate logic to perform the techniquesdescribed herein. Such representations, known as “IP cores” may bestored on a tangible, machine readable medium and supplied to variouscustomers or manufacturing facilities to load into the fabricationmachines that actually make the logic or processor.

Such machine-readable storage media may include, without limitation,nontransitory, tangible arrangements of articles manufactured or formedby a machine or device, including storage media such as hard disks, anyother type of disk including floppy disks, optical disks, compact diskread-only memories (CD-ROMs), compact disk rewritables (CD-RWs), andmagneto-optical disks, semiconductor devices such as read-only memories(ROMs), random access memories (RAMs) such as dynamic random accessmemories (DRAMs), static random access memories (SRAMs), erasableprogrammable read-only memories (EPROMs), flash memories, electricallyerasable programmable read-only memories (EEPROMs), phase change memory(PCM), magnetic or optical cards, or any other type of media suitablefor storing electronic instructions.

Accordingly, some embodiments also include nontransitory, tangiblemachine-readable media containing instructions or containing designdata, such as Hardware Description Language (HDL), which definesstructures, circuits, apparatuses, processors and/or system featuresdescribed herein. Such embodiments may also be referred to as programproducts.

Emulation (Including Binary Translation, Code Morphing, Etc.)

In some cases, an instruction converter may be used to convert aninstruction from a source instruction set to a target instruction set.For example, the instruction converter may translate (e.g., using staticbinary translation or dynamic binary translation including dynamiccompilation), morph, emulate, or otherwise convert an instruction to oneor more other instructions to be processed by the core. The instructionconverter may be implemented in software, hardware, firmware, or acombination thereof. The instruction converter may be on processor, offprocessor, or part on and part off processor.

FIG. 18 is a block diagram contrasting the use of a software instructionconverter to convert binary instructions in a source instruction set tobinary instructions in a target instruction set. In the illustratedembodiment, the instruction converter is a software instructionconverter, although alternatively the instruction converter may beimplemented in software, firmware, hardware, or various combinationsthereof. FIG. 18 shows a program in a high level language 1802 may becompiled using an x86 compiler 1804 to generate x86 binary code 1806that may be natively executed by a processor with at least one x86instruction set core 1816. The processor with at least one x86instruction set core 1816 represents any processor that can performsubstantially the same functions as an Intel® processor with at leastone x86 instruction set core by compatibly executing or otherwiseprocessing (1) a substantial portion of the instruction set of theIntel® x86 instruction set core or (2) object code versions ofapplications or other software targeted to run on an Intel® processorwith at least one x86 instruction set core, in order to achievesubstantially the same result as an Intel® processor with at least onex86 instruction set core. The x86 compiler 1804 represents a compilerthat is operable to generate x86 binary code 1806 (e.g., object code)that can, with or without additional linkage processing, be executed onthe processor with at least one x86 instruction set core 1816.Similarly, FIG. 18 shows the program in the high level language 1802 maybe compiled using an alternative instruction set compiler 1808 togenerate alternative instruction set binary code 1810 that may benatively executed by a processor without at least one x86 instructionset core 1814 (e.g., a processor with cores that execute the MIPSinstruction set of MIPS Technologies of Sunnyvale, Calif. and/or thatexecute the ARM instruction set of ARM Holdings of Sunnyvale, Calif.).The instruction converter 1812 is used to convert the x86 binary code1806 into code that may be natively executed by the processor without anx86 instruction set core 1814. This converted code is not likely to bethe same as the alternative instruction set binary code 1810 because aninstruction converter capable of this is difficult to make; however, theconverted code will accomplish the general operation and be made up ofinstructions from the alternative instruction set. Thus, the instructionconverter 1812 represents software, firmware, hardware, or a combinationthereof that, through emulation, simulation or any other process, allowsa processor or other electronic device that does not have an x86instruction set processor or core to execute the x86 binary code 1806.

The foregoing outlines features of several embodiments so that thoseskilled in the art may better understand various aspects of the presentdisclosure. Those skilled in the art should appreciate that they mayreadily use the present disclosure as a basis for designing or modifyingother processes and structures for carrying out the same purposes and/orachieving the same advantages of the embodiments introduced herein.Those skilled in the art should also realize that such equivalentconstructions do not depart from the spirit and scope of the presentdisclosure, and that they may make various changes, substitutions, andalterations herein without departing from the spirit and scope of thepresent disclosure.

All or part of any hardware element disclosed herein may readily beprovided in a system-on-a-chip (SoC), including central processing unit(CPU) package. An SoC represents an integrated circuit (IC) thatintegrates components of a computer or other electronic system into asingle chip. The SoC may contain digital, analog, mixed-signal, andradio frequency functions, all of which may be provided on a single chipsubstrate. Other embodiments may include a multichip module (MCM), witha plurality of chips located within a single electronic package andconfigured to interact closely with each other through the electronicpackage. In various other embodiments, the computing functionalitiesdisclosed herein may be implemented in one or more silicon cores inapplication-specific integrated circuits (ASICs), field-programmablegate arrays (FPGAs), and other semiconductor chips.

As used throughout this specification, the term “processor” or“microprocessor” should be understood to include not only a traditionalmicroprocessor (such as industry-leading x86 and x64 architectures byIntel®), but also any ASIC, FPGA, microcontroller, digital signalprocessor (DSP), programmable logic device, programmable logic array(PLA), microcode, instruction set, emulated or virtual machineprocessor, or any similar “Turing-complete” device, combination ofdevices, or logic elements (hardware or software) that permit theexecution of instructions.

Note also that in certain embodiments, some of the components may beomitted or consolidated. In a general sense, the arrangements depictedin the figures should be understood as logical divisions, whereas aphysical architecture may include various permutations, combinations,and/or hybrids of these elements. It is imperative to note thatcountless possible design configurations can be used to achieve theoperational objectives outlined herein. Accordingly, the associatedinfrastructure has a myriad of substitute arrangements, design choices,device possibilities, hardware configurations, software implementations,and equipment options.

In a general sense, any suitably-configured processor can executeinstructions associated with data or microcode to achieve the operationsdetailed herein. Any processor disclosed herein could transform anelement or an article (for example, data) from one state or thing toanother state or thing. In another example, some activities outlinedherein may be implemented with fixed logic or programmable logic (forexample, software and/or computer instructions executed by a processor)and the elements identified herein could be some type of a programmableprocessor, programmable digital logic (for example, a field-programmablegate array (FPGA), an erasable programmable read only memory (EPROM), anelectrically erasable programmable read only memory (EEPROM)), an ASICthat includes digital logic, software, code, electronic instructions,flash memory, optical disks, CD-ROMs, DVD ROMs, magnetic or opticalcards, other types of machine-readable mediums suitable for storingelectronic instructions, or any suitable combination thereof.

In operation, a storage may store information in any suitable type oftangible, nontransitory storage medium (for example, random accessmemory (RAM), read only memory (ROM), field programmable gate array(FPGA), erasable programmable read only memory (EPROM), electricallyerasable programmable ROM (EEPROM), or microcode), software, hardware(for example, processor instructions or microcode), or in any othersuitable component, device, element, or object where appropriate andbased on particular needs. Furthermore, the information being tracked,sent, received, or stored in a processor could be provided in anydatabase, register, table, cache, queue, control list, or storagestructure, based on particular needs and implementations, all of whichcould be referenced in any suitable timeframe. Any of the memory orstorage elements disclosed herein should be construed as beingencompassed within the broad terms ‘memory’ and ‘storage,’ asappropriate. A nontransitory storage medium herein is expressly intendedto include any nontransitory special-purpose or programmable hardwareconfigured to provide the disclosed operations, or to cause a processorto perform the disclosed operations. A nontransitory storage medium alsoexpressly includes a processor having stored thereon hardware-codedinstructions, and optionally microcode instructions or sequences encodedin hardware, firmware, or software.

Computer program logic implementing all or part of the functionalitydescribed herein is embodied in various forms, including, but in no waylimited to, hardware description language, a source code form, acomputer executable form, machine instructions or microcode,programmable hardware, and various intermediate forms (for example,forms generated by an HDL processor, assembler, compiler, linker, orlocator). In an example, source code includes a series of computerprogram instructions implemented in various programming languages, suchas an object code, an assembly language, or a high-level language suchas OpenCL, FORTRAN, C, C++, JAVA, or HTML for use with various operatingsystems or operating environments, or in hardware description languagessuch as Spice, Verilog, and VHDL. The source code may define and usevarious data structures and communication messages. The source code maybe in a computer executable form (e.g., via an interpreter), or thesource code may be converted (e.g., via a translator, assembler, orcompiler) into a computer executable form, or converted to anintermediate form such as byte code. Where appropriate, any of theforegoing may be used to build or describe appropriate discrete orintegrated circuits, whether sequential, combinatorial, state machines,or otherwise.

In one example, any number of electrical circuits of the FIGS. may beimplemented on a board of an associated electronic device. The board canbe a general circuit board that can hold various components of theinternal electronic system of the electronic device and, further,provide connectors for other peripherals. More specifically, the boardcan provide the electrical connections by which the other components ofthe system can communicate electrically. Any suitable processor andmemory can be suitably coupled to the board based on particularconfiguration needs, processing demands, and computing designs. Othercomponents such as external storage, additional sensors, controllers foraudio/video display, and peripheral devices may be attached to the boardas plug-in cards, via cables, or integrated into the board itself. Inanother example, the electrical circuits of the FIGS. may be implementedas stand-alone modules (e.g., a device with associated components andcircuitry configured to perform a specific application or function) orimplemented as plug-in modules into application specific hardware ofelectronic devices.

Note that with the numerous examples provided herein, interaction may bedescribed in terms of two, three, four, or more electrical components.However, this has been done for purposes of clarity and example only. Itshould be appreciated that the system can be consolidated orreconfigured in any suitable manner. Along similar design alternatives,any of the illustrated components, modules, and elements of the FIGS.may be combined in various possible configurations, all of which arewithin the broad scope of this specification. In certain cases, it maybe easier to describe one or more of the functionalities of a given setof flows by only referencing a limited number of electrical elements. Itshould be appreciated that the electrical circuits of the FIGS. and itsteachings are readily scalable and can accommodate a large number ofcomponents, as well as more complicated/sophisticated arrangements andconfigurations. Accordingly, the examples provided should not limit thescope or inhibit the broad teachings of the electrical circuits aspotentially applied to a myriad of other architectures.

Numerous other changes, substitutions, variations, alterations, andmodifications may be ascertained to one skilled in the art and it isintended that the present disclosure encompass all such changes,substitutions, variations, alterations, and modifications as fallingwithin the scope of the appended claims. In order to assist the UnitedStates Patent and Trademark Office (USPTO) and, additionally, anyreaders of any patent issued on this application in interpreting theclaims appended hereto, Applicant wishes to note that the Applicant: (a)does not intend any of the appended claims to invoke paragraph (f) of 35U.S.C. section 812, as it exists on the date of the filing hereof,unless the words “means for” or “steps for” are specifically used in theparticular claims; and (b) does not intend, by any statement in thespecification, to limit this disclosure in any way that is not otherwiseexpressly reflected in the appended claims.

EXAMPLE IMPLEMENTATIONS

There is disclosed in one example, an apparatus, comprising: adeterministic monitored device; an interconnect to communicativelycouple the monitored device to a support circuit; a super queue to queuetransactions between the monitored device and the support circuit, thesuper queue comprising an operational segment and a shadow segment; adebug data structure; and a system management agent to monitortransactions in the operational segment, log corresponding transactionidentifiers in the shadow segment, and write debug data to the debugdata structure, wherein the debug data are at least partly based on thecorresponding transaction identifiers.

There is further disclosed an example of an apparatus, wherein thesupport circuit is non-deterministic.

There is further disclosed an example of an apparatus, wherein theoperational segment and the shadow segment are equal in size.

There is further disclosed an example of an apparatus, wherein thesystem management agent is to identify a transaction in the operationalsegment as an outbound transaction, and to not log the outboundtransaction in the debug data.

There is further disclosed an example of an apparatus, wherein thesystem management agent is to log a corresponding inbound operation tothe shadow segment.

There is further disclosed an example of an apparatus, wherein theoutbound transaction is a read, the corresponding inbound transaction isan inbound write, and the system agent is to log a time stamp for theinbound write, and a result code in the shadow segment.

There is further disclosed an example of an apparatus, wherein thesystem management agent is further to log the inbound data.

There is further disclosed an example of an apparatus, wherein thesystem management agent is configured to opportunistically orperiodically log outbound transactions for synchronization.

There is further disclosed an example of an apparatus, wherein thesystem management agent is further to log asynchronous transactions.

There is further disclosed an example of an apparatus, wherein thesystem management agent is further to log snoops for outboundtransactions.

There is further disclosed an example of an apparatus, wherein themonitored device is a processor core.

There is further disclosed an example of an apparatus, wherein thesupport circuit is an uncore circuit.

There is further disclosed an example of an apparatus, wherein thesupport circuit is an intellectual property (IP) block.

There is further disclosed an example of an apparatus, wherein themonitored device is a graphics processor.

There is also disclosed an example of a computer apparatus, comprising:a deterministic core; a non-deterministic support circuit; aninterconnect to communicatively couple the core to the support circuit;a super queue to queue transactions between the core and the supportcircuit, the super queue comprising an operational segment and a shadowsegment; a debug data structure; and a system management agent tomonitor transactions in the operational segment, log correspondingtransaction identifiers in the shadow segment, and write debug data tothe debug data structure, wherein the debug data are at least partlybased on the corresponding transaction identifiers.

There is further disclosed an example of a computer apparatus, whereinthe operational segment and the shadow segment are equal in size.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is to identify a transaction in theoperational segment as an outbound transaction, and to not log theoutbound transaction in the debug data.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is to log a corresponding inbound operationto the shadow segment.

There is further disclosed an example of a computer apparatus, whereinthe outbound transaction is a read, the corresponding inboundtransaction is an inbound write, and the system agent is to log a timestamp for the inbound write, and a result code in the shadow segment.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is further to log the inbound data.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is configured to opportunistically orperiodically log outbound transactions for synchronization.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is further to log asynchronous transactions.

There is further disclosed an example of a computer apparatus, whereinthe system management agent is further to log snoops for outboundtransactions.

There is further disclosed an example of a computer apparatus, whereinthe support circuit is an intellectual property (IP) block.

There is further disclosed an example of a system-on-a-chip, comprisingthe apparatus of a number of the above examples.

There is further disclosed an example of a debug system, comprising thecomputing apparatus of a number of the above examples, and an emulatorto emulate a portion of the computing apparatus according to the debuginstructions.

There is further disclosed an example of a debug system of the aboveexample, wherein the portion of the computing apparatus is the core.

There is also disclosed a method of providing debug information for acomputing apparatus, comprising: communicatively coupling adeterministic monitored device to a support circuit via an interconnect;dividing a super queue into an operational segment and a shadow segment,the super queue to queue transactions between the monitored device andthe support circuit; provisioning a debug data structure; and monitoringtransactions in the operational segment; logging correspondingtransaction identifiers in the shadow segment; and writing debug data tothe debug data structure, wherein the debug data are at least partlybased on the corresponding transaction identifiers.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the support circuit is non-deterministic.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the operational segment and the shadowsegment are equal in size.

There is further disclosed a method of providing debug information for acomputing apparatus, further comprising identifying a transaction in theoperational segment as an outbound transaction, and not logging theoutbound transaction in the debug data.

There is further disclosed a method of providing debug information for acomputing apparatus, further comprising logging a corresponding inboundoperation to the shadow segment.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the outbound transaction is a read, thecorresponding inbound transaction is an inbound write, furthercomprising logging a time stamp for the inbound write, and a result codein the shadow segment.

There is further disclosed a method of providing debug information for acomputing apparatus, further comprising opportunistically orperiodically logging outbound transactions for synchronization.

There is further disclosed a method of providing debug information for acomputing apparatus, further comprising logging asynchronoustransactions.

There is further disclosed a method of providing debug information for acomputing apparatus, further comprising logging snoops for outboundtransactions.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the monitored device is a processor core.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the support circuit is an uncore circuit.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the support circuit is an intellectualproperty (IP) block.

There is further disclosed a method of providing debug information for acomputing apparatus, wherein the monitored device is a graphicsprocessor.

There is further disclosed an example of an apparatus comprising meansfor performing the method of a number of the above examples.

There is further disclosed an example of an apparatus, wherein theapparatus comprises a system on a chip.

There is also disclosed a debug system comprising an apparatus, and anemulator to emulate a portion of the computing apparatus according tothe debug instructions.

There is further disclosed a debug system, wherein the portion of thecomputing apparatus is the core.

1.-24. (canceled)
 25. An apparatus, comprising: a deterministic device;support circuitry coupled to the deterministic device via an interface;wherein the deterministic device comprises a super queue to log outboundtransactions from the deterministic device to the support circuitry andlog inbound transactions from the support circuitry to the deterministiccore corresponding to the outbound transactions; and wherein the supportcircuitry comprises a system management agent to write debug data to adebug data structure based on the logged inbound transactions in thesuper queue.
 26. The apparatus of claim 25, wherein the supportcircuitry is non-deterministic.
 27. The apparatus of claim 25, whereinthe super queue comprises an operational segment to log the outboundtransactions, and a shadow segment to log the inbound transactions. 28.The apparatus of claim 27, wherein the operational segment and shadowsegment are equal in size.
 29. The apparatus of claim 27, wherein thesuper queue is to store read operations as outbound transactions in theoperational segment, and inbound write operations corresponding to theread operations in the shadow segment.
 30. The apparatus of claim 29,wherein the super queue to log, in the shadow segment, a time stamp forthe inbound write and a result code.
 31. The apparatus of claim 30,wherein the super queue is further to log, in the shadow segment, theinbound data associated with the inbound write.
 32. The apparatus ofclaim 25, further comprising circuitry coupled to the interface tocapture one or more of asynchronous transactions and snoops.
 33. Theapparatus of claim 25, wherein the deterministic device is a processorcore.
 34. The apparatus of claim 33, wherein the support circuitry is anuncore circuit.
 35. The apparatus of claim 33, wherein the supportcircuitry is an intellectual property (IP) block.
 36. The apparatus ofclaim 25, wherein the deterministic device is a graphics processor. 37.The apparatus of claim 25, wherein the apparatus is a system-on-chip(SoC).
 38. A system, comprising: a deterministic core; non-deterministicsupport circuitry; an interface to communicatively couple the core tothe support circuitry; and an emulation store; wherein the deterministicdevice comprises a super queue to log outbound transactions from thedeterministic device to the non-deterministic support circuitry and loginbound transactions from the non-deterministic support circuitry to thedeterministic core corresponding to the outbound transactions; andwherein the non-deterministic support circuitry comprises a systemmanagement agent to write debug data to a debug data structure based onthe logged inbound transactions in the super queue.
 39. The system ofclaim 38, wherein the super queue comprises an operational segment tolog the outbound transactions, and a shadow segment to log the inboundtransactions.
 40. The system of claim 39, wherein the operationalsegment and shadow segment are equal in size.
 41. The system of claim39, wherein the super queue is to store read operations as outboundtransactions in the operational segment, and inbound write operationscorresponding to the read operations in the shadow segment.
 42. Thesystem of claim 41, wherein the super queue to log, in the shadowsegment, a time stamp for the inbound write and a result code.
 43. Thesystem of claim 42, wherein the super queue is further to log, in theshadow segment, the inbound data associated with the inbound write. 44.The system of claim 38, further comprising circuitry coupled to theinterface to capture one or more of asynchronous transactions andsnoops.